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includes a putative gene in said DNA sequence", as recited, % example, in claim 1, and 
similarly recited in claims 16, 23 and 29. 

Applicant respectfully submits that the Examiner 
in Rigoutsos '99 at page 228, left col., first paragraph: 



we pro* zess 



n [o]ne of the seqlets that are discovered when 

and is present in the following ten ORFs: Ofth 

gi_3 329230 are annotated as Fe-S oxidoreductases. " 



ese 



appears to be confiised with a passage 



the input database is 
gi_3328856 and 



Applicant submits that the examiner is likely misinteri rating this passage to mean 
that Rigoutsos '99 has taught how to annotate these two sequelices and Rigoutsos ^9 claims that 
these two sequences are Fe-S oxidoreductases. That is, the Ej aminer appears to be interpreting 
this phrase as "are annotated by Rigoutsos '99". This is clearl; r not correct. 

in fact, nowhere does Rigoutsos '99 teach or suggest & motating seqlets as Fe-S 
oxidoreductases. Applicant respectfully submits that this simple passage could not possibly be 
considered to "teach" or "suggest" annotating these seqlets. C srtainly, one of ordinary skill in the 
art could not read Rigoutsos *99 and know how to annotate tie seqlets as Fe-S 
oxidoreductases. Indeed, nowhere does Rigoutsos "99 teach i ny method, process or criteria for 
annotating these seqlets. 

In fact, this passage is simply indicating that some thiijjd party has annotated these 
seqlets in the public databases as Fe-S oxidoreductases. Indeed, the precise third party 
responsible for annotating these seqlets can be identified if on< > goes to the public records for 
these sequences and tracks down the bibliography related to these seqlets. 

Thus, this passage should be interpreted "are annotate^ by third parties in the public 
databases as Fe-S oxidoreductases." 

Moreover, even assuming (arguendo) that this passage teaches annotating the seqlets as 



Fe-S oxidoreductases, Applicant would point out that it would 
claims a "functional behavior" and this kind of annotation has 



be the kind of "annotation" that 
Nothing whatsoever to do with 
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"annotate" is overloaded in terms 



identifying genes (e,g., "annotating a DNA sequence by claiming that it actually codes for a 
gene") which is a purpose of the claimed invention 

Clearly, in the field of the claimed invention, the verb 
of what it means. Thus, even if two people of ordinary skill n the art having a conversation 
agree to interpret ,r annotate n to mean "annotate a protein seqi ence by stating its putative or 
validated functional behavior", there is still ambiguity. In far t, this "ambiguity" is discussed in 
the passage indicated on page 3912 of the attached paper Rig mtsos et al, "Dictionary-driven 
protein annotation", Nucleic Acid Research, 2002, Vol. 30, K o. 17 pp 3901-3916. 

Thus, Applicant again submits that neither Rigoutsos 99, nor Delcher, nor any alleged 
combination thereof teaches or suggests a processor which to nslates an open reading frame 
(ORF) of a DNA sequence into an amino acid translation, an< : locates in the amino acid 



translation occurrences of patterns from a pattern database to 
frame includes a putative gene in the DNA sequence. 



determine whether the open reading 



FORMAL MATTERS AND CONCLUSION 



s 1-30, all the claims presently 



In view of the foregoing, Applicant submits that clainds 
pending in the application, are patentably distinct over the pri )r art of record and are in condition 
for allowance. The Examiner is respectfully requested to pa* s the above application to issue at 
the earliest possible time. 

Should the Examiner find the application to be other than in condition for allowance, the 
Examiner is requested to contact the undersigned at the local elephone number listed below to 
discuss any other changes deemed necessary in a telephonic or -personal interview . 
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deficiency in fees or to credit any 



overpayment in fees to Assignee's Deposit Account No. 50-0510. 
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Date: 
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ABSTRACT 

Computational methods seeking to automatically 
determine the properties (functional, structural, 
pbysicochemical, etc) of a protein directly from the 
sequence have long been the focus of numerous 
research groups. With the advent of advanced 
sequencing methods and systems, the number of 
amino aeld sequences that are being deposited In 
the public databases has been increasing steadily. 
TTils has In turn generated a renewed demand for 
automated approaches that can annotate Individual 
sequences and complete genomes quickly, exhaust- 
ively and objectively, in this paper, we present one 
such approach that is centered around and exploits 
the Bio-Dictionary, a collection of amino acid pat- 
terns that completely covers the natural sequence 
space and can capture functional and structural 
signals that have been reused during evolution, 
within and across protein families. Our annotation 
approach also makes use of a weighted, position- 
specific scoring scheme that Is unaffected by the 
over-representation of well-conserved proteins and 
protein fragments in the databases used. For a 
given query sequence, the method permits one to 
determine, En a single pass, the following: local and 
global similarities between the query and any 
protein already present In a public database; the 
likeness of the query to all available archaea!/ 
bacterial/eukaryotic/viral sequences in the database 
as a function of amino acid position within the 
query; the character of secondary structure of the 
query as a function of amino acid position within 
the query; the cytoplasmic, transmembrane or 
extracellular behavior of the query; the nature and 
position of binding domains, active sites, post- 
translatlonally modified sites, signal peptides, etc. 
In terms of performance, the proposed method Is 
exhaustive, objective and allows for the rapid anno- 
tation of individual sequences and full genomes. 
Annotation examples are presented and discussed 
in Results, including individual queries and com- 
plete genomes that were released publicly after 
we built the Bio-Dictionary that is used In our 



experimen ;s. Finally, we have computed the annota- 
tions of m >re than 70 complete genomes and made 
A - able on the World Wide Web at http:// 
on.Ibm.com/Anno1ations/. 



them aval 
cbcsrv.wa 



INTRODUCTION 



The auto ma 



ic elucidation of protein function directly from 
sequence hai been the focus of research activity for many 
years. Such sh elucidation has an obvious appeal for (t tries to 
minimize the | amount of associated manual labor by reducing a 
large numbe r of possibilities to one or a handful of choices. 
This i$ typii ally achieved by tapping into repositories of 
previously accumulated knowledge and by trading computa- 
tion (i.e. in jilico approaches) for typically ledious manual 
analysis. Th : discovery of protein function directly from 
sequence, in an automated or semi -automated manner, has 
become a ft nd&mcntal question as thousands of Unknown 
proteins and increasing numbers of complete genomes arc 
made availal le daily in the public domain. Of course, one 
should not Ic ic sight of the fact that protein annotation is the 
first step in tl|e attempt CO fully describe a particular organism 
through characterization of its metabolic pathways and 
transcription regulation networks. 



During 
proposed for 
of which are 
approach to 



past three decades, numerous methods have been 
letermining protein function from sequence, all 
sntialiy instances of a 'guilty by association' 
>lving this problem. Depending on the nature of 
the information exploited and the manner in which the 
information is used, these methods can essentially be divided 
Into a handful of well-differentiated categories. 

The chronc logically earliest examples of protein annotation 
methods rcrv on the o^tenrlination of a local or global 
similarity between a query protein and proteins with known 
annotation ti at are contained in a database (1^). If two 
sequences of :omparable length share a large portion of their 
extent, the previously uncharaterized sequence will inherit the 
function of tlie annotated one. The validity of this scheme 
relies On the implicit assumption that two sequences that 'look 
the same* at the sequence level also have the same function 
and structurel Thia is a rea *wable assumption, but counter- 
examples such as the dehydrogenasc/z-crystanin case have 
also been documented in the literature (for a discussion of this 
particular caie see for example 5). The methods in this 
category are kbown as 'sirriilarity-based* or 'bOTnology-based' 
and are numerous. The approach we present in this paper 
belongs to thi| category as well. 



♦To whom correspoitfeace should be addressed. Tel; +1 914 945 1384; Fax; + 1 914 945 4104; Enfil: rigoutso@us.fbm.oom 
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Figure 1. An example of incorrect protcui annotation as o result of multiple domain shoring, lie sequences hava been aligned in » manner thai shows their 
common domain^ The 'D-a^bino 3-hcxufose ^phosphate formaldehyde lyase' function baTbA propagated from the H^SJBAC^c^e to^^St 
of the sequences in to group. Instance* of the same domain have De^ shown in the sequences t ,al contain it using the same cX S shXg^hlmf 



A recurrent situation within the similarity-method cat- 
egory that is pertinent to our discussion relates to the 
inclination of annotators to use either the first or the best 
'hit* from the output of a database search that is carried 
out by one of the similarity search algorithms such as 
FASTA (3), BLAST (4), Smith-Waterman (2), etc. In the 
presence of domains that are shared by numerous proteins 
(6), choosing the first or the best hit may not be optimal. 
As a matter of fact, the multi-domain organization of 
proteins can lead to incorrectly annotated database entries 
{Fig. 1 shows a characteristic example of such a 
misannotauon). Use of a domain scan and the exploit- 
ation/analysis of the generated output when annotating a 
query can substantially improve the results. Such a domain 
scan can be implemented, for example, with the help of 
the PROSITE, PRINTS, PFAM, BLOCKS or PRODOM 
databases (7-11). For a description of other sources of 
potential concern in protein annotation and some recom- 
mended solutions the reader is referred to previous 
publications (12-16). 

A second category of methods has become known as 
the Rosetta stone approach {also known as the domain 
fusion method) (17-19), Here, one seeks to determine 
groups of proteins that are distinct in a given organism 
but appear as a single product in another organism, the 
result of an assumed fusion event. The distinct proteins in 
the original group are assumed to be physically interacting 
and, depending on the specifics of each case* this 
information can be helpful in determining protein function. 

The methods in the third category seek to determine 
groups of proteins that repeatedly appear as neighbors of 
one another in the chromosomes of different organisms. 
The proteins involved are assumed to have a functional 
relationship (this methodology 3s similar in flavor to the 
Rosetta stone approach but distinct with respect to the 
type of information that it uses). Exploitation of this idea 
has found a best fit in the case of prokaryotic genomes 
where proximal gene organization is manifested in the 
form of operons, and it has been used successfully to 
guide functional annotation (20). It is not evident, 
however, whether this idea carries over to eukajryotic 
organisms due to the fact that, in general, the latter lack 
operons. A closely related variation, which does extend to 
eukajryotic organisms, operates on the assumption that If 
an organism is in need of a Specific pathway then the 
organism will carry all or most of the genes comprising 
the pathway, and vice versa. For example, the work of 
Marcotte (17) and similar work done by others attempt to 
define function in terms of the pathways and complexes in 



sion of the 
DNA- and 
assumption 
correlated 



which the | protein participates, rather than suggest a 
specific biochemical activity; in this framework a protein 
is associate! with a function via its linkages to other 
proteins. 

Finally, h l recent years, a fourth category has emerged. 
Here, one tickles the problem of protein function eluci- 
dation throigh the analysis of correlated mRNA exp res- 
type that is encountered in the context of 
micro array-chip experiments. The underlying 
s that functionally related proteins will exhibit 
A expression levels under multiple experi- 
mental settihgs. Consistent participation of a previously 
uncharacterixed protein in clusters comprising proteins 
with a welHinderstood function imposes constraints on the 
unknown pijotein's possible behavior by restricting its 
candidate memberships within the context of a metabolic 
pathway (2l|^ In principle, this method can help resolve a 
tion. A more recent variation of this general 
urcs levels of protein expression (instead of 
the help of mass spectrometry, or 2D gel 
s in an attempt to determine clusters of 



protein's 

approach m 
mRNA) wi 
clcctrophore 
strongly co- 



to determine the function of uncharacterized proteins. 



We next 
problem of 
approach is 



xprcssed proteins: these clusters can be used 



present and discuss a new approach to the 
protein annotation. At the center of our 

JHhe Bio-Dictionary* an exhaustive collection of 
attems, heretofore referred to as seqlets, that 
ivcrs the natural sequence space of proteins 
that the latter is sampled by the currently 
logical sequences. In earlier studies, we 



showed that 



within and 
approach re 



the seqlets can capture both functional and 



structural signals that have been reused during evolution 



across families of related proteins. Our 
les on the seqlets contained in the Bio- 
Dictionary and the associated Information that is available 
in a welJ-m; intained database such as SwissProt/TrEMBL 
(22), derives from an earlier prototype system we built to 
carry out smilarity searching (23,24) and employs a 
weighted, position-specific scoring scheme that is not 
affected by I the over- representation of well-conserved 
proteins and protein fragments that are present in the 
public databases. Although similar in intent to systems like 
GcncQuiz (:2 b), our method goes beyond simply stating 
the presence of local and global similarities between a 
query and oic or more database sequences: in fact, we 
also report information about the secondary structure 
characteristics of the query, the presence of known 
domains, signal peptides, active sites, post-tran$lationally 
modified siles, cytcplasrmc/extracellular behavior, the 
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similarity of the query to each of the three phylogenetic 
domains as a function of amino acid position, etc. 



MATERIALS AND METHODS 

Background 

The Bio-Dictionary idea was introduced and discussed in 
earlier works (26-28); therein we described bow one can use 
the Teircsias pattern discovery algorithm (29,30) to process a 
very large public database of amino acid sequences and 
fragments and derive a collection of amino acid patterns that, 
by design, appear within as well as across family boundaries. 
These patterns arc referred to as seqlets and have been shown 
to capture functional and structural signals; moreover, tbey 
have the very important property of completely describing the 
processed input at the amino acid level. Following are some 
seqlet examples that include the name of the represented 
feature or of the represented protein family, taken from 
Rigoutsos et al. (26); GDG [IV AMTD] ND [ AILV][PE AS] 
[AMV][LMIF]..A (cation-transporting ATPases), V.I.O- 
G..G...A (NAD/FAI>*binding flavoprotcins), G..G.GK[ST]TL 
(ATP/GTP-binding P-loop), KMSKS [LKDIR] [GNDFQ]N 
(class I aminoacyl-tRNA synthetases), H.....HRD.K-N 
(serine/threonine protein kinases), etc. In terms of the notation 
used, [LKDIR] means a choice of exactly one among L, fCD, 
I and R, whereas V denotes a single position wild-card 
character that can replace any of the symbols in the alphabet. 

The derived collection of seqlets can thus be treated as a 
(redundant) vocabulary for the natural sequence space of 
proteins to the extent that the latter is uniformly sampled by 
the currently available biological sequences. Associating the 
entries of this vocabulary-like collection with the annotation 
information contained in a typical entry of the SwissProt 
database allows us to convert the collection into a dictionary. 
We have been using the term Bio-Dictionary to refer to the 
collection of seqlets that has been augmented so as to include 
the 'annotation meaning* for each of the entries. The key 
elements behind the Bio-Dictionary, and details of its 
construction, can be found in Rigoutsos et al, (26); analysis 
of the 3D structural properties associated with the seqlets of a 
dictionary built out of 17 complete archacal and bacterial 
genomes are given in Rigoutsos et aL (27); finally, a 
discussion and description of potential uses for it appears in 
Rigoutsos et al, (28). In more recent work, wc applied the Bio- 
Dictionary to in silicv prokaryotic gene rinding and built a 
system with exceptional performance (31): unlike approaches 
that are based on Markov models where each distinct genome 
requires that a different model be built, our gene finding 
system is universal in that a single instance of it is used across 
all archaeal and bacterial genomes. 

The earlier work 

By carrying out pattern discovery on a given sequence 
database Z>, we can use the generated pattern collection C to 
carry out similarity searches for instances of a query or its 
fragments in D as follows: a pattern from the derived 
collection C of patterns that matches a region of the query 
under consideration pinpoints a potential local similarity 
between the matched region of the query and all of the 
sequence fragments from the input database that the pattern 



In earlier 
Release 34 
System for s 



Nu&teic Acids Research, 2002, Vol, 30 No. 17 390$ 



represents (iccall that by the definition of pattern discovery, 
patterns apdear k or more times in the processed input, with 
k & 2). 

work, we used the Teiresias algorithm to process 
of the SwissProt database and built a prototype 
~j ™~ . - - milarity searching using only a subset C" from the 
derived collection C of patterns. A given query sequence was 
examined for matches of patterns contained in C* and the 
query and database regions corresponding to the matches were 
aligned, scored, and finally sorted according to the computed 
score. Folio ving the sorting, one could proceed in one of two 
distinct way 1: (i) the user was presented with the collection of 
patterns tha : matched the query and was asked to identify 
those of biological importance, then alignments were gener- 
ated usins: those patterns alone; (xi) those alignments that 
patterns whose database instances carried an 
otation (namely the TT line) were reported to 
urthcr study (23,24). 

system was meant to be a proof of concept. 
Consequently, complete coverage of the input database by the 
patterns in the collection C was neither achieved nor pursued. 

of fact, this early system used a mere 565 432 
eh covered -50% of the processed SwissProt 
amino acid level. Neither the existing over- 
of various protein families in the database nor 
real-time performance were design consider- 
time. However, this early excursion provided an 
invaluable learning experience that helped guide us toward the 
system whiJh we present next 

The metholi: 




resulted 
associated 
the user for 
This earl 



As a matt 
patterns w 
database at 
represented 
the system 
ations at th 



description 

The first ancj foremost consideration of the new approach is the 
achievement of a complete coverage of the natural sequence 
space as it is currently known. To that end we used as our 
domain of knowledge the 14 May 2001 release of SwissProt/ 
TrEMBL, a large, publicly available and curatcd database 
(22). This particular release comprised 532 621 amino acid 
sequences a! id fragments with a grand total of 170 762 058 
amino acids, 

Wc processed this input database in two phases. First, we 
ran Teiresias using L = 8, W - 8 and K *= 2 and generated 
variable len rth seqlets that contained no wild-cards. For each 
one of these seqlets, we located and masked all of its instances 
in the input database except for the one that appeared in the 
longest anumg the sequences containing instances of the 
seqlct. Wc pen reran Teircsias on the masked input, but this 



time using J\ 
scheme and 



days worth 



= 6 and W= 15. For more information about this 
other methodological details the reader is referred 



implement* 



2 days on a 
The two 
Dictionary 



to Rigoutsos et al. (26). The processing required -45 CPU 



of computation using IBM RS64IE processors 



wUh a clod . speed of 450 MHz. With the help of a parallel 



ion of Teiresias that we have developed for shared 



memory architectures, we completed this computation in 



24 processor IBM S-SO supercomputer, 
pattern discovery phases generated the Bio- 
bat we used in our analysis and which contained 
a combined total of 42 996 454 seqlets [compare the size of the 
current collection with the 565 432 patterns used in Floratos 
et al. (23,24ft] that accounted for 98.2% of the processed input 
at the arnin > acid level (this degree of coverage essentially 
implies thatt on average, a mere five amino acids per protein 
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sequence cannot be accounted for by this seqlet collection). 
The length and density distributions of these seqlets match 
closely the ones shown in Rigoutsos at al. (26), whereas the 
average length of a seqlet is -12-13 amino acids. It should be 
noted that this Bio-Dictionary contains redundant seqlets, i.e. 
a given amino acid position in the processed input will 
typically participate in and is covered by multiple seqlets; this 
redundancy of representation is a desired property which we 
exploit during annotation. 

Each seqlet in the collection carries along the meaning(s) 
associated with the regions of the proteins that contained an 
instance of and gave rise to the seqlet. Instead of a static 
description of each seqlet 's meaning(s) in the manner that we 
described in Rigoutsos et al. (28), we opted for composing die 
full entry of each seqlet during the run time as required. We 
currently derive labels for meanings from only the DE, OC 
and FT lines of the respective SwissProt/TrEMBL entry; 
obviously, we can tap into any database containing comple- 
mentary information and attach additional meanings to each 
seqlet. One obvious choice is the PDB (32,33): in previous 
publications (27.28) we described how 3D structure can be 
associated with seqlets and are currently in the process of 
extending the approach presented herein in order to recon- 
struct local 3D structure using the structural hypotheses 
generated by partially overlapping seqlets (D.Platt, 
l-Rigoutsos, Y.Gao and L,Parida, submitted for publication). 

Recall that the DE or 'description' line of SwissProt 
contains general descriptive irtformation about the respective 
entry. Similarly, the OC or 'organism classification' line 
contains the taxonomic classification of the source organism. 
And the FT or 'feature table' line contains a simple and precise 
means for the annotation of the sequence data; a fixed 
abbreviation with a defined meaning relating to the feature 
being reported is followed by the residue numbers indicating 
the end points (extent) of the named feature; the line ends with 
a description containing additional information about the 
feature. Of the available SwissProt/TrEMBL abbreviations 
contained in an FT line we only make use of the ones listed in 
Table 1. 

When presented with a query Q to annotate, we carry out the 
steps outlined in Figure 2, a markedly different approach than 
the one used in our early prototype. First, we generate the 
'intersection* of the Bio-Dictionary with the query sequence to 
find those seqlets that match somewhere in the query. For 
each of the seqlets in this intersection, we examine the 
corresponding SwissProt/TrEMBL entries for all of the 
sequences that gave rise to the seqlet during the Bio- 
Dictonary formation, thus building the corresponding diction- 
ary entry Wthe-fly* by dynamically attaching to the seqlet all 
the meaning(s) extracted from those entries. The extracted 
meanings essentially 'color* each seqlet and by extension the 
region of the query where the seqlet matches. Note that a given 
seqlet can carry multiple 'colors', i.e, attributes. Con- 
sequently, a region of the query can be associated with 
multiple attributes. If the seqlet under consideration is 
attached to an attribute that has not been encountered before, 
then a new attribute vector is introduced: the attribute vector 
has the same length as the query and initially contains zeroes 
everywhere; the current seqlet assigns its contribution 
CONTRT£(...) to this new attribute vector at precisely the 
region corresponding to the seqlefs match in the query; If the 



Table 1. PT Hbe Labels used in our work (*x also text) 



mwLrcs 
carbohyd 
propep 
dna_bind 

helix 



lipid 

metal 

chain 

site 
strand 



disulfid 
binding 
peptide 
tnuismam 

turn 



tbioetb 

transit 

ca_bind 

2H_fing 

Don_C0tt3 

se_cys 



thtolcst 

signal 

dOOiAin 

similar 

non^tcr 



seqlet carrie:; an attribute that has been encountered before, the 
seqlet adds ts contribution CONTRIB(^.) to the appropriate 
region of the already existing attribute vector. Multiple seqlets 
that carry the same attribute will add their individual 
contribution p to the attribute's vector; the regions to which 
the seqlets <jontrjbute may or may not be overlapping. The 
manner in which wc decide what amount a seqlet will 
contribute tc an attribute vector is described in detail below. 

Alter all sfcqlets in the intersection have been exhausted, and 
separately ft|.r each attribute category (e.g. DE t FT, etc.), the 
attribute vecjsrs are sorted and ranked based on the accumu- 
lated support and the top T ranking vectors of the category are 
reported (7 is a user-specified parameter). Each of these 
vectors will contain non-zero values at precisely those regions 
that were matched by possibly overlapping, distinct seqlets 
that carried ihe same attribute. 

describe the scoring scheme it is important to 
ints that are particular to our work. In general, 
nary should not be seen as a collection of seqlets 
necessarily captures a specific feature such as a 
kinase doroajn, a metal binding site, etc. Seqlets that can act as 
fc 



Before 
Stress some 
theBio-Di 
each of 



predicates 
Dictionary, 
meanings 
ence that the 
of predi 
and 

As we 
functional 



a feature or protein family do exist m the Bio- 
iU by design, seqlets may also carry multiple 
Is is different from the one-to-oue correspo tid- 
ier may be accustomed to and which is typical 
containing databases such as PROSITE, PRINTS 
iO (7.8,34), 

>wcased previously (26); a seqlet can cross 
(d structural boundaries and can thus be associ- 
ated with multiple meanings. Clearly, those of the seqlets that 
are associated with a unique meaning can function as 
predicates, bj|it a significant number of them will capture and 
correspond to multiple meanings. 

Similarly, |he Bio-Dictionary also contains multiple seqlets 
all of which capture the same meaning. These seqlets can also 
have instances that overlap with one another, as indicated by 
the fact that Che product of the number of seqlets contained in 
the Bio-DicAtonary times the seqlets" average length is a 
multiple of tjto actual length of the processed input (26). 

Thus, by dssign, a given position of a processed query will 
in general be) covered by multiple seqlets. Each of the seqlets 
covering a position within the query will in general carry one 
or more meanings that are used to 'color' the corresponding 
region of the query. Let a given query position be covered by 
M distinct se< lets. In order for an attribute, e.g. 'metal-binding 
site', to ranlsj high in the reported results, a large portion of 
those M seqlets must carry this attribute, 

Recall thaL by definition, each of the seqlets of the Bio- 
Dictionary appears in at least two places in the processed 
database (SwassProt/TrEMBL in our case): thus, if Af seqlets 
cover a giveh position in the query to annotate, then the 
following two properties will simultaneously hold: 
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l) determine the subset s of seqlets in the Blo-Dict: 
In the query Q with length \Q\ ; 



onary that match regions 



2) for each seqlet a in 5 do { 

2a> let q< Kam and q w denote the region in the query matched by a , 
2b) use the Bio-Dictionary information to acqegg ai: instances of seqlet s 



in the SwissProt/TrEMBL database and let P 
corresponding SwissProt/TrEMBL entries ; 

2c) for each SwissProt/TrEMBL entry p in P { 

- let [pxrcPto] denote the instance of seqlet 
entry p under consideration ; 

- retrieve full SwissProt/TrEMBL record R for 



denote the set of 

3 in the SwissProt/TrEMBL 

the respective entry p ; 
the reeord R for p ; 



- retrieve organism classification 0Cp fro* 

- if (Oc p has not been encountered before) { 

- create a one-dimensional score arrafr with length | CI ; 

- initialize the array to all O's andj set OC p as its attribute i 

- assign CONTRIB< [p rroB ,p to ] ,s) to the interval [q f «.,qe.] of 

this new array ; 

) 

else { 

- add CONTRJB( CPfy<ta/Pt 0 ) * S) to interval 

existing array with attribute 0C P 

> 



[<3fro»'qc»] Of the already 



- retrieve description D% from the record * for p ; 

- if (DEp has not been encountered before) [i 

- create a one-dimensional score array with length 1 01 ? 

- initialise the array to all O's and|set DE* as its attribute 

- assign C0NTRIB( Eprr^ptoJ . s> to the interval lq tKvm ,q kx >} of 

this new array ; 

> 

else { 

- add C0NTRIB( tpfr»,pfceKa) to interval q c »l Of the already 

existing array with attribute PE» ; 

- from the record ft, retrieve all features FT> that overlap with the 

instance (p*mrp«<>) of s in the containing sequence ; 

- determine the interval of intersection [ilfrnxito] of each annotated 

region in R with the instance [pt^rp^L] of a ; 

- for each feature f in PT P with non-zero intersection fl< w i t9 ] { 

if (f has not been encountered before) |{ 

- create a one-dimengional 3core ar; ay with length \0\ ; 

- initialize the array to all 0*s .and set f as its attribute 

- assign C0NTRIB( tp*«„,pt 0 ] , s) to the I interval 

Cq/^>{i«r 0 .-Pt.«),qc««,+ (iw-Pfr*«>) |pf this new array; 

else { 

- add C0HTRIB< (p*« f pt»] , s) to the interval 

(<£lr«m + ( ifnm'PftOA ) t qfi:aa + ( itr>~ Pf ) ) 

array with attribute f ; 



of the already existing 



2d> OPTIONAL STEP - repeat this process for other useful information in record R 



2e) repeat steps 2b) through 2d) for seqlet $ and a: 
that contain useful and/or complementary attri 

3) for each of the result categories (e.g. oc, DE, f: 
rank all score arrays, and finally report the 
the category ; 

Figure 2. Pseudo-code showing Che computational steps of tho method. 



other available databases 
>ute information ; 

etc.), normalize the scores, 
top-ranking attributes in 



e there exists a total of F sequence fragments corresponding 
to all Of the instances of the M seqlets in the processed input 
database; clearly, these fragments will be similar to the amino 
acid neighborhood surrounding this query position; 
• the F sequence fragments in the database will agree on the 
amino acid identity of the literals (recall that the seqlets 



contain both literals and wild-cards, i.e. 'dots') contained in 
each of the hi seqlets. 

These dafc base sequence fragments may or may not agree 
on the annou rion of the query position under consideration. If 
the annotatio is for N of these F database sequence fragments 
scate that this site is a metal-binding site then through 
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Bio-Dfctfonary 
scqlet,: when©-, wtiam? 



saqtotk: wterei where? ... where*, 
seqteti^wher*t where? whe«^ 




figure 3* Accumulating seqlel contributions. Seqkts do rtol have to span a feature in Us 
specific enough to set as predicates for the attribute. See also text for more details. 



application of the "guilty by association' approach, our belief 
that this query position is also a metal-binding site will be 
proportional to MF. This very idea is applied to every attribute 
and attribute category that is attached to a scqlet. The direct 
implication of this is that a scqlet can be useful and able to 
contribute to a specific annotation without having to span an 
attribute (e.g. protein kinase domain) in its entirety. Moreover, 
the seqlet does not have to be a predicate for an attribute 
either. 

Figure 3 graphically depicts this situation. For discussion 
purposes, let us assume that when we searched the query Q 
with the Bio-Dictionary, we determined that seqletK is present 
in Q in the region b ftom , q l0 l Let us also assume that during 
the Bio-Dictionary formation, seqlet K was determined to have 
three instances in the processed SwissProt/TrEMBL database. 
Alter following these three backpointers to the full entries of 
the sequences that contain the three instances of seqle%> we 
determine that in one of the sequences the seqlet instance 
spans an interval [pfrom- p to ] that has a non-empty intersection 
with a specific region [feat rrtMn featcJ of chat sequence that is 
annotated as npjrind atp, i.e. as an ATP-binding site. Let 
Vcrom, ha\ denote the intersection of the intervals [p/h^n, p to ] 
and [featft^, feat^]. In this example, seqlet K will corroborate 



in order to corroborate an attribute. Nor do they have to be 



the prcsenceflof a partial ATP-binding domain in the query that 
is being annotated by incrementing the support at the locations 

faffcwn + Ofto}» -Pfram)* tffrnm + O'tn - P frond] of the np_bind atp 

attribute vector. 

It should now be clear why any given seqlet does not have to 
serve as a predicate for the attributes) that it corroborates. The 
term 'attribute' is overloaded in our discussion and should be 
interpreted gather loosely: it can mean a local similarity, a 
global similarity, an active site, a pbyiogenetic domain, a post- 
translationafly modified location, etc. 

If the query being annotated contains a true instance of a 
given attribute, then each one of the numerous seqlels thai will 
cover the region spanned by the attribute more than once will 
cumulatively and Independently provide support for the 
attribute at | the respective positions: as the accumulated 
support for the attribute increases so does the likelihood of 
its presence in the query. 

If the query is a true member of a known protein family, 
then we expect the attribute vector for this family to obtain 
support along its entire length from practically every single 
one of the sc \ltts that match in this query. If a query contains a 
known dom tin, then the attribute vector for the domain will 
have non-zero support over the region of the query that 
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corresponds to the domain's in seance . In an analogous manner, 
if the processed query shares only a local region with a well- 
characterized family or an individual protein, then the 
corresponding attribute vector will have non-zero support 
only over the shared region. 

The manner in which we use (he seqlets and accumulate 
scores has proven particularly useful in situations that, among 
others, include the following; the query is a fragment of a 
known sequence; the query contains one or more known 
domains in a novel order: the query has been assembled using 
an incorrect exon collection (e.g. one or more true exons axe 
missing, introns have been mislabeled as exons and included 
in the assembly, exons that correspond to distinct genes have 
been assembled together, etc.). 

Moreover, the fact that our seqlets have lengths that 
typically span between 6 and 18 amino acids (for a detailed 
discussion see 26) permits us to easily and correctly process 
very short input queries, e.g. 8-10 amino acids, without the 
thresholding constraints and limitations that one typically 
encounters when using heuristics-based similarity search 
algorithms (3,4). 

In real-world applications, situations arise where the query 
represents or contains only a fragment from a Known domain, 
for example a query involving the first few tens of amino acids 
from say a 'protein kinase domain*. In order to alert the user to 
this situation, we also include, wherever applicable, the 
'minimum*, 'average* and * standard deviation* values for the 
span of each of the T top ranking reported attribute vectors. 
This permits easy determination of whether the query 
represents a complete instance of the stated attribute or only 
a fragment 

The method; scoring 

//env much to contribute. Above we described how we 
determine the extent of an attribute vector region to which a 
seqlet matching the query will contribute. We now discuss 
how we determine the amount that the seqlet will contribute. 

Let seqlet K be present in query Q and let 4n?i2?i3»"4n and 
PjiPj2Pj3- • 'Pj\ be its instances in the query and in some database 
sequence <£ respectively; let {r'j, ... r*j} and {Ju - - • 7i } denote 
the indices of the positions spanned by the scqlct in the query 
Q and sequence d t respectively. For simplicity, we will assume 
that the instance of seqlet K completely spans an annotated 
region of d that corresponds to an attribute A. 

Seqlet K brings together two sequence fragments with 
lengths equal to the span of die seqlet; one fragment comes 
from the query that is being analyzed while the other is from 
the sequence d of the database. Obviously, the more similar 
these two fragments are the more likely it is that upon 
completion of the annotation process the attribute A that is 
associated with the database region p i iP}7Pp...Pji will be 
carried over to the query region 9ii?i2?i3-"?ii through the 
'guilty by association* approach. There is a rather straight- 
forward manner in which seqlet K can contribute to the vector 
for attribute A\ we simply use one of the available scoring 
matrices and generate contributions in a position- and content- 
dependent manner as follows; 

for m = 1 to / {atrributcjvector[i| + m - \] 
+=^(scoring_jnatrix[^ l + m - t )J>ji + m - f])} 
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(the symbo^ += is shorthand notation for 'increment by 
amount shoyn on the right of the sign'). In other words, the 
seqlet will contribute to the (i t +• m - l)th position of the 
attribute vejtor an amount that relates to the degree of 
similarity between the amino acids occupying the positions q\i 
* n> - i wid pii + m - u respectively. A good choice for the 
function y(.)pbove is fix) = 2 X + constant; with regard to the 
scoring matr x to use one can employ any of the standard PAM 
or BLOSITN scoring matrices (35>36). 

In order o avoid the over-counting that would be the 
consequence of a given protein family or fragment being over- 
represented n the S wissProt/TrEMB L database, we impose 
the additional constraint that a given seqlet cannot contribute 
to the same attribute vector and vector positions) more than 
once. In other words, if seqletK. captures a very well-conserved 
region appearing in a large number of SwissProt/TrEMBL 
sequences, only one of the scqlct' s numerous annotated 

database ins! ~* ' 

vector. 



ances will contribute to the respective attribute 



How to norn alize. As mentioned already, a given seqlet with 
distinct possjple meanings will contribute in turn to each of the 
attribute vec tors that correspond to those meanings. And these 
contributions I will depend on how well a known database 
instance of mt attribute matches its alleged instance in the 
query. Diffe-cnt attribute vectors will accumulate different 
amounts of contribution and these contributions will also 
depend on ti e position within the attribute vector. 

During Ch : annotation of the query wc maintain a book- 
keeping arra f> total_contrib, with length equal to that of the 
query; for e 1 cry seqlet with an instance ftittzftsffti in toe 
query, we uj. date total as follows: 



for m = 1 to 



In other w 
of the num 
contribution 
amino acids 
The functi 
that at all 
is greater t 
encounter 1 
vectors for 

Once all 
examined, 



I {totaLeontribfr + m - 1) 
eormg|matrix[?ii + m _ JOj, + „ _ ,])} 



, the (to position of cotaLcomrib is a measure 
of seqlets that contribute to it, with each 
eighted by the degree of similarity between the 
the query and their input database counterparts. 
ft.) is the same as in the previous section. Note 
s during processing, the value of total_contrib[t] 
an or equal to the maximum value one will 
the Ah position of any of the active attribute 
query. 

f the seqlets matching the query have been 
normalize the contents of the ftb position of 
each attribuje vector by dividing by the value of totals 
contrib[i]. Multiplying this normalized value by 100 gives us, 
for each attribute vector, a measure of the fraction of the total 



contribution 



position witt In the query. Well-conserved attributes will have 



values close 
have fewer 
smaller valu 
has the addi 



that this attribute has received, as a function of 



to 100% whereas less conserved attributes will 
eqlets contributing to them and thus will have 
. Note that this particular way of normalizing 
ional property of alleviating the situation where 
equal length regions of the query receive disproportionately 
different cot tributions due to differences in the number of 
contributing seqlets: this normalization will permit all regions 
in the query to have equally Strong voices'. 
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ii'-tii; 

CHamxsu x*clkx. sondart dewafara <i tia 
Uxzm * MLkttd *t*Mra«*» I?.OD(, 5.02, 5_25; 




ft& attu ,, wwjflc. ttmdard do«B$nj of f!w 
Eej£ux4 &:icte*4 ■ f 1.00. 1.0*7. t. 99} 



Figure 4. Some resells from processing the human ubiquitin UBIQ_HUMAN by ovr cnerbod. 
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Figure 5. Additions] results from processing the human ubiquitin UBFQ_KUMAN by our mctboa 



How to rank. Once the contents of the attribute vectors have 
been normalized, we sort them based on their received 
contributions and report the top T of them. We have 
implemented a scheme that will rank a narrow, well-conserved 
region higher than a wider region which is not as well 
conserved. This permits us to report attributes such as well- 
conserved active sites or post-o-anslationally modified sites 
among the top ranking positions of the results. Finally, when 
we report Local similarities* we further require of the attribute 
vectors pertaining to similarities that any set of consecutive 
non-zero positions be at teast X positions wide; the value of X 
is user-defined and typical values range in the interval [10, 20]. 

The method: bow to find matches in the query 

In order to efficiently implement the above method we need to 
be able to quickly determine which of the Bio-Dictionary 
scqlcts match where within the query. A simplistic approach 
would require that, for the -43 000 000 seqlets and every 
single query position, we check whether there is a match; this 
would of course be very slow. The problem of identifying such 
matches is complicated by the presence of wild-cards (' don't 
care characters') in the scqlcts that we use. 

To deaj with this situation, we have designed and imple- 
mented a novel and very efficient method for solving precisely 
this problem: our method makes use of a very efficient hashing 
scheme that subselects among the Bio-Dictionary scqlets prior 
to using the ones that survive in conjunction with a modified 
version of the Aho-Corasick algorithm (37). The resulting 
scheme permits us to fully annotate a 300 amino acid protein 
in -10 s on a single IBM R564IQ processor running at a clock 
speed of 450 MHz- The description of this matching algorithm 



the scope of this presentation and will be 
elsewhlere (M.Lewensteifl, T.Huynh and IJUgoutsos, in 



extends beyond 
given 

preparation). 
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RESULTS 

We next showcase the capabilities of our approach by 
annotating a I carefully selected collection of example queries 
and discussing the results obtained. All of the results we report 
in this sect j on can be reproduced using the Web-based 
implementation of our method, available at bttp://cbcsrv. 
watson.ibm.CDm/Tpa. html. The underlined text in the figures 
generated by the Web-based tool is in fact hyperlinks which 
permit the us ;r to issue a search request to $ wissProt/TrEMBL 
and retrieve all of the database entries with the property stated 
by the text. This capability is meant to facilitate cross- 
comparisons and verification of the reported results. 
Moreover, u_ xm completion of an annotation, the user can 
view the Bi >-Dictionary patterns that matched within the 
query, as we 1 as each pattern's estimated log probability and 
the actual position within the query where the match begins. 

Example 1. UBIQ^HUMAN 

As our first pxample, we examine the annotation of the 76 
amino acid \ uman ubiqujtio, UBIQJHUMAN. Some of the 
results of tfu analysis are shown in Figures 4-6. As can be 
seen from t lese figures, tbo SwissProt/TrEMBL database 
contains enc igh information for our method to correctly 
determine the: secondary structure of the fragment: notice the 
localization Df the helices, strands and turns and their 
interweaving in Figure 4. 
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Figure 6. More results from processing the human ubiquitin UBIQ_HUMAN by our method 
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Figure 7. Some of the results obtained from processing the fragment WVTAHAF with Our rac 
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Figure & Partial results from processing the adrcoocoTticotropic hormone receptor protein ACrRJJJOVIN by our method. 




It is not always the case that there will be enough 
information id SwiasProt/TrEMBL for us to be able to make 
statements about the local secondary structure of a query. This 
limitation can be alleviated in one of two ways: (i) we can rely 
on SwisaProtflrEMBI/s continuing augmentation and up- 
dates — as the database becomes bigger and more enriched, our 
capability to annotate local structure will also improve; (ii) we 
can make use of the information in the PDB database in the 
manner that we have described (26,27), The seqlets' meanings 
will be enriched by incorporating structural information from 
the much more comprehensive PDB; we are currently in the 
process of augmenting our annotation method so that it will 
include this component. Finally, note how our method 
correctly determines the nature and position of seven sites 



mm 

, ttta4»d<U<(ftbi*ri of das feutur* a ids** 



that are relevant to the function of ubiquitin as well the 
presence and extent of the ubiquitin domain. 

Example 2. ^ very short fragment 

For our second example, we have selected the $ amino acid 
fragment W VTAHAF> a fragment that is too short to be used 
with heuristits'bascd similarity search algorithms such as 
FASTA and BLAST/PSI-BLAST. As shown in Figure 7, 
when we princess this fragment with our method we can 
correctly determine that: (i) it is an amino acid combination 
encountered <jmly in the eukaryotic domain; (ii) it belongs to a 
cytochrome a oxidase; (iii) it .is part of a transmembrane 
domain; and fiv) it has a metal-binding site (iron) at the sixth 
position from the beginning, i.e. at the position of the histidine. 
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Figure 9. Results of RPS-BLA5T and PSI-BLAST using the aoquciica UL78_HCMVA as an inpul . Default parameter setting* were used. 



Example 3, ACTR^BOVIN 

Another capability of our system is the determination of 
cytoplasmic, transmembrane and extracellular regions in a 
given query. Wc showcase mis using ACTR^BOVIN, an 
adrenocorticotropic hormone receptor protein from Bos taunts. 
Figure 8 shows the plots for the cytoplasmic and extracellular 
behavior of the query as determined by our method: note that 
the regions of the query that are not accounted for by these two 
plots correspond precisely to the seven transmembrane domains 
of the ACTRJ30VIN (the corresponding transmembrane plots 
are not included in this figure). 



inpd. 



Example 4. |JL78_HCMVA 

The next example is a sequence that comes from the human 
herpesvirus jj (39). In particular, it is the 431 amino acid 
sequence ULp JiCMVA. In Figure 9 we show the output of 
both RPS-BIJaST and PSI-BLAST (3$) on this specific query 
sequence: as tan be seen, the only detectable similarity is with 
the rhodopsiij family and is confined in the region [60, 170); 
no other similarities can be determined outside this region (the 
PSI-BLAST lut at the second position is with an uncharacter- 
ized sequence from the Tupaia herpesvirus and thus is not 
informative). One possible interpretation is that there is no 
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Figure 10. Partial results from processing the sequence UL78_HCMV"A from human herpesvirus 



Table 2, Best GPCRDB/SwissProt hits wheo using- TJL78.HCMVA regions as the query (sea also text) 



Sr See text tor more details, 



Mutative TM H*lix ¥1 

Vt7B_HCMVA 45 GMTCSVStVWLLniGC 

G+F+S* IV IL + C 
PE2i_H0MAtf 205 CT,TAgT.CT,v»T.™iT.vr 



61 

221 



Putative IM Helix #2 

0L78_HCMVA 74 VKIFrWHLVLSQPPSIlATJCLS 95 

+ IF-H? LV4-K* «► +L 
ZBI2_BPHAN 151 VC YFVWTr.VFflOTLPIX 167 



Putative TH Helix *3 

^8_HCNVA 112 VLFVDDVGLVSXMr^rLFLIL 132 

LYS ALP-*LPL 
PSAB_JU*THA 135 ArefrVKEfftSFX 145 



Putative im Helix #4 

UL7$_HCMVA 154 GVALYAVAFAWVLSXVAAVPT 17* 

A+ +VAP AVP 
OPSD_RANCA 152 AKMCVArwrMaEACAYE 170 



3 



Put&tive IM flelijc #5 

UL^dJICMVA 202 WWPXLGAPHJAVLAIJVYCLAYS 222 
I + PZil£ P+ AV+ 

P2Y3_ MELGA it> m aWMfflVY 43 



Putative TK jjelix f6 

DX7ajRCMVA 23 6 VCTPYVTCLHtFVPYrCPRVL 25$ 

VCT ++ FVP++ 

PAFR_WO0SE 2^4 VgPVIAVPTTCrtfTKH 249 



Putative IM 
yi»78_HCMVA 2 
OPSD_TODPA 1 



llelix #7 
Jo TRTL&: 

tT R ] 
rasa; 



0 TRTLZ/TMRI^IJLPLFIIAFPS 
T R IX, +FX* FT 
tTRSKTl^KFTT^FP 



300 
209 



single sequence in the SwissProt/TrEMBL database chat 
resembles UL78_HCMVA (other than the query itself and 
its Tupaia herpesvirus counterpart), cither in terms of the order 
of any domains that may be present or in terms of its 
composition. 

When we process UL78_HCMVA with our method, we 
discover weak similarities that relate UL78_HCMVA mainly 
to hypothetical proteins in a manner that is similar to what is 
shown in Figure 9. However, further inspection of our results 
provides us with enough infojnmarjon to appropriately 
categorize the query. In Figure 10 wc show the plot for the 
query's transmembrane behavior as reported by our method: 
seven very distinct regions are immediately apparent thus 
permitting us to conjecture that this sequence is a G protein- 
coupled receptor homology The seven regions correspond to 



the intervals 



boundaries. 



45-61, 74-95, 112-132* 154-174, 202-222, 



236-256 and 280-300, respectively, and have well-delineated 



database using UL7&LHCMVA as the query currently 



generates no 
In Table 2 



sfotabJy, a similarity search in the GPCRDB 



hits to known families. 

we show the alignment for the best ranking hits 
obtained whe n we search the GPCRDB database subset that 
contains onl)j SwissProt entries (but not TrEMBL) using each 
of the sevenjutative transmembrane regions as a query. If the 
UL78_HCM|fA putative transmembrane regions correspond 
to transmembrane helices of a GPCR homolog one wilj expect 
to see them patching known transmembrane regions from 
sequences jn|GPCRDB. This is indeed the case, as this table 
shows: the regions of the GPCRDB/SwissProt hits that are 
labeled as tj&smembrane regions of a G protein- coupled 
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receptor are shown underlined bold next to each of the seven 
queries. The only exception is the query corresponding to the 
putative TM helix 3, where chc best match is to a 
transmembrane region from PSAB ^ANTMA, a photosystem 
protein. In all seven cases the quality of the conservation is 
notable. It should also be noted chat several sequences other 
than UL78_HCMVA were reported as GPCR homoJogs when 
the analysis of the complete genome of the human herpesvirus 
5 was first published (39). 

Example 5. Comparison with the annotations of recently 
published/updated genomes 

We next showcase the capability of our approach by 
processing the complete genomes of three organisms whose 
sequences were published after 14 May 2001 and compare our 
results with the annotations that accompanied the release of 
the respective genomes. Since the Bio-Dictionary that we used 
for our experiments was built using the SwissProt/TrEMBL 
release of 14 May 2001 7 the results from these comparisons are 
indicative of our method's ability to extrapolate and annotate 
novel sequences. Additionally, we annotated two genomes 
whose sequences were released into the public domain prior to 
14 May 2001. Obviously, the sequences of these organisms are 
already contained in the input database from which we built 
the Bio-Dictionary used in these annotations. However, the 
GenBank database entries for these genomes and the respec- 
tive annotations were updated several months after 14 May 
2001: it is these more recent annotations that we use for our 
comparative study and not the annotations that accompanied 
the original genome submission. The purpose behind these 
comparisons is to determine the extent of agreement between 
our predictions and the original annotators* updated predic- 
tions when using a sequence database that has been substan- 
tially augmented since the genomes under consideration were 
originally deposited 

It should be stressed that any such comparisons can only 
provide estimates of what a user can expect when using our 
method to annotate a genome. Indeed, the very notion of an 
automated comparison of different annotation collections is, to 
a certain degree, ill-defined. The following observations will 
make this last statement clear. 

First, the published genomes are sequenced, annotated and 
released by different research groups which employed differ- 
ent automated tools in conjunction with generally distinct, 
although overlapping, knowledge bases of annotated bio- 
logical sequences. Once the automatically obtained gene 
annotations become available, they are typically curated 
manually during a "genome annotation jamboree' by a 
different team of scientists each time and using non-standard 
nomenclature and abbreviations. As a result of this manual 
eviration, the annotations that accompany a newly published 
genome contain much more Chat simply the result of applying 
a 'guilty by association' automated approach. This last 
observation puts us at a distinct disadvantage when carrying 
out the comparisons that we report below. 

Independently of the annotation approach that is used, there 
is always the issue of what it means to have 'annotated a 
protein*. Even ignoring disagreements in the annotations Of 
individual proteins, several levels of detail are possible when 
making an annotation. As an example, Table 3 shows valid, 
non-contradicting annotations for a fictitious protein: the thing 



Tabid 3. Anne Cations for a ficuuonu protein thm arc non-conflicuag with 
one another but correspond to varying degrees of conveyed annotation 
detail I 



Non-confliccin| annoc&cicms for a ficticious protein 

Cellular process protein 
Membrane protein 
Integral tnembimc protein 
Procain involved in cellular signaling 
G protein-coupled receptor 
Secrctin-like pi jatein 
Corticotropin releasing factor 



CO notice he c is the different amount of information that is 
conveyed each annotation statement. Ideally, one seeks the 
most detailed description possible for the available knowledge 
base. The possibility of different levels of annotation detail 
adds an extra degree of difficulty and can result in annotation 
disagrccmcrls when lists of annotations that have been 
reported by different groups at differeni points in time are 
automatically compared with one another (13,14,40-45). 

ignores the above difficulties, differences can 
result of using different guidelines and criteria 
S leading to substantial variations in the claimed 
f genes that can be annotated in a newly 
sequenced gibnome based on sequence similarity with known 
proteins. Generally speaking, the current state of the art 
permits one to report functional hypotheses for -70% of the 
predicted gehes in a given prokaryotic genome (43—51). The 



Even if os 
still arise as 
each time, t 
percentage 



fraction for 
although in 



eukaryotic genomes is typically much lower, 
the case of specific eukaryotic chromosomes, 
notable exceptions exist (52). 

In light ofi the above observations, we decided to generate 
our figures of manually comparing, for each and every one of 
the involved genes, the annotations reported during the release 
of the genome with those generated by our method. Hie results 
are given i i Table 4. The first three genomes, namely 
Rickettsia caiorii Malish 7 (53), Staphylococcus aureus Mu50 
(54) and Sl'tptococcus pneumoniae TIGR4 (55) were pub- 
lished and nlade available in the Fall of 2001. The last two 
genomes, nimcly Chlamydia pneumoniae J138 (56) and 
Buchnera spll APS (57) were published in June and September 
2000, respectively , but their GenBank records were updated in 
the Pall of 2tf)01 , For each genome we report the number and 
percentage of genes that fall into each of the following 
categories: u) the latest GenBank annotation and our anno- 
tation agree! (ii) the GenBank annotation contains a 'hypo- 
thetical protein ' entry whereas our system proposes a 
functional hypothesis; (ill) the GenBank annotation lists a 
functional hypothesis whereas our system reports a 'hypo- 
thetical protjpin'; and (iv) the GenBank annotation and our 
annotation disagree. 
As shown in Table 4, for the two genomes that were updated 



recently, tl 
and the lat 



agreement between our automated predictions 
GenBank annotations reaches a level of 98% 
over the entile genome. It should be noted that this figure also 
includes those genes for which there is no functional 
hypothesis Q,e. they ate listed as 'hypothetical proteins'). 



For the thr< 



predictions ranges between 88 and 92%, It is worth reiterating 



novel genomes, the agreement between the 
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Table 4. Results firan manually comparing oar predictions with the annotations that hava been eported for sever** genomes 



Genome name 



Latent GenBank 
annotation date 



No. of predicted 
genes 



Latesi GenBank 
anoot. 

= B-D annot. 
(hypothetical 
proteins included) 
(no, of genes)J 



Latest 

annoL 
« hypothetic 1 ^ 

protein 
&.& B-D anW = 
functional hypothesis 
[% (no. of riines)) 



KconoHt Malish ? 
S .aureus Mu50 
S. pneumoniae TICR4 
Cpneumonioe 113 S 
Bwc/ijwra sp, APS 



3 Oct 2001 1374 

4 Oct 2001 2748 
3 Oct 20QI 2094 
2 Oct 2001 1089 

10 Sep 200) 564 



88.94% (1222) 
91.85% (2524) 
B7.87& (1840) 
98.41% <I052) 
97.69% <55l) 



7.06% (97) 
7.28% (200) 
4.25% (89) 
0.04% 
1,24% 



Latest Geofionk 
annoL 

= functional hypothesis 

B-D an not. = 
hypothetical protein 
[% (oo. of genes)) 



Latest GenBank. 
annoL 

B-D annoL 
(hypothetical protcina 
not included) 
[% (no. Of genes)] 



2.04% (28) 
0.18% (5) 
2.6356 (55) 
0.05% (6) 
0.01% (1) 



.1.96% (27) 
0.69% (19) 
5,25% 0 10) 
0.07% (7) 
1.06% (6) 



The first three of the genomes listed are novel in that they were published several months afjer we built the Bio-Dictiooaxy used to generate functional 



hypotheses. The remaining two glomes were published in 2000, but their GenBank entries wra 
output matches in quality the annotations that have been mads available after manual curation of 



b updated in the Full of 2001 . As can be seen, our system's 
lutomated analysis. See main text for details. 



that the annotations that are included in the GenBank entries 
for the various genomes are the result of manually curatxng the 
output of multiple automated tools, whereas our scheme 
generates annotations in an entirely automated manner using a 
single unified framework. In recent collaborative work with 
colleagues from several European laboratories the complete 
genome of Chlamydia trachomatis serovar D (58). was re- 
annotated using G) manual means, (u> traditional automated 
tools and (iii) our method. As described in detail (LDiopouios, 
S.Tsoka, M.A.Andradc, AJ.Enright, MCaroll, PJtouIlet, 
V.Proroponaa, T.Liakopoulos, GJPalaios, C.Pasquier, 
S.Hamodrakos ex al.> submitted for publication), the annota- 
tions that were obtained through manual means and through 
our Bio-Dictionary-based method achieved the best overall 
performance reaching an annotation agreement on 862 of the 
893 processed sequences, i.e. 96.5% of the entire genome. Of 
the remaining 31 sequences, 13 could be annotated manually 
but could not be annotated by our method, whereas the other 
18 could be annotated with our method but could not be 
annotated manually. 

Example 6, Annotations on the World Wide Web 

Similarly to the previous example, we have annotated the 
sequences of more than 70 complete genomes across the three 
phylogenetic domains, including: Methanococcus jannaschii 
DSM 2661 (59), Halobacterium sp. NRC-1 (60), Sulfolobus 
solfataricus P2 (61), Mycoplasma genitaliwn G-37 (62). 
Synechocystis sp, PCC 6803 (63). Escherichia coti K12- 
MG165 (64), Helicobacter pylori 26695 (65), BorreOa 
burgdorferi B31 (66), Aquifex aeolicus VF5 (67), Myco- 
bacterium tuberculosis H37Rv (68), Chlamydia trachomatis 
serovar D (58), Chhxmydophxla pneumoniae CWL029 (69), 
Thermotoga maritima M5B8 (70), Deinococcus radiodurans 
Rl (71), Yersinia pestis C092 (72), Saccharomyces cerevisiae 
S288C (73), Caenorhabditis elegans (74), Drosophila mela- 
nogaster (75), Homo sapiens (76,77) and Mas musculus. The 
annotations of these genomes are available on the World Wide 
Web and can be viewed and interactively explored by visiting 
http://cbcsrv.watson.ibm.com/Annotations/. 

The system that we make available oo the World Wide Web 
provides the user with several options. Within a specific 
genome, if the accession number of a gene is known, then it 
can be used to locate and view the annotation of the gene. 



Alternative^ , one can search the results in the DE and FT 
attribute cate gories of the genome using regular expressions 
that can be altered with the help of a graphical user interface. 
For example, when run against the DE results, the regular 
expression 

-[l-3]].*calc urn *bind 



will locate and report all the sequences in the genome under 
consideration that 1 Share any similarities with calcium binding 
sequences aijnd are ranked in the top three positions'. 



Analogously 
expression 



when run against the FT results, the regular 



- [ l-9J].*domMn -*bh[ 1 234). * 

1 
i 

will permit the user to search for sequences that 'contain one 
or more of tij e cell apoprosis- associated domains BH1, BH2 T 
BH3 and BHjl and are ranked >n the top nine positions'. To list 
the three top franking functional hypotheses for each gene in a 
genome, one can use the regular expression 

-[1-3]) 

to search through the DB results. At http://cbcsrv,watson. 
ibm.com/Heljp/ShowMeHowToSearch.html the user can find 
information qjn how to form these regular expressions and tbe 
permitted keywords, as well as several specific examples with 
explanations] 

Additionally, we have enabled and made available cross- 
genomic comparisons/searches: through a graphical user 
interface, onjp or more genomes can be selected and their 
annotations searched for similarities with a specific family 
(e.g. elongate on factor, tiWA-aminoacyl synthetase, etc.) or 
the presence taf a specific feature (e.g. hydrogen bond donor, 
calcium-binding domain, helix-turn-beKx, etc.) with the help 
of regular expressions similar to those used to analyze 
individual genomes. 



DISCUSSIC N AND FUTURE DIRECTIONS 

In this paper, we have presented and discussed a new method 
for the automated annotation of amino acid sequences. The 
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method quickly, objectively and exhaustively determines local 
and global similarities between a given query and any protein 
already present in a public database, the likeness of the query 
to all available archaeal/bacterial/eukaryotic/vtral sequences 
in the database as a function of amino acid position within the 
query, the secondary structure character of the query as a 
function of amino acid position within the query, the 
cytoplasmic, transmembrane or extracellular behavior of the 
query, the nature and position of binding domains, active sites, 
post-transladonally modified sites and signal peptides, etc. 

The key concept underlying our method is that of the Bio- 
Dictionary, which we presented and discussed in earlier work. 
By design, the presented method is extendable and can make 
use of any type of attribute that would be of interest to the end 
user. It can also make use of multiple databases. 

Through a carefully selected collection of examples, we 
have demonstrated the capabilities of our method and the 
quality of the annotations that it generates. Our system 
automatically generates results whose quality matches that of 
publicly available annotations; recall that such annotations are 
typically the product of a manual curation that has followed 
the application of automated processes. In terms of actual 
annotation speed, our system can annotate a 300 amino acid 
query in -10 s on a single IBM R$64in processor running at a 
clock speed of 450 MHz. 

We arc currently in the process of enhancing our system 
with several new components. One extension involves the 
automatic determination and reporting of all the PubMed 
references pertaining to the query sequence that is annotated. 
For each of the reported results in the DE category we will be 
making available links to all PubMed articles that are relevant 
for the study of the query sequence and the family described 
by the caption. This is currently work in progress. 

A second extension, which we have already described 
above, involves the automated generation of local 3D structure 
through 'meanings' that are derived from the contents of the 
PDB database. This is also work in progress. 

Finally, an important topic that we will be studying relates 
to the fact that the SwissFrot/TrEMBL database has up to now 
used non-standardized nomenclature to label database entries. 
For example, the following are some of the DE lines that are 
associated with aldose reductases: 

aldose reductase (ec i. 1.1.21) (ar) (aldehyde reductase) 
aldose reductase 

alcohol dehydrogenase [nadp+] (cc 1.1.1.2) 
(aldehyde reductase). 

When our system is presented with an aldose reductase as a 
query, e.g. ALDRJHTJMAN, then multiple attribute vectors 
will be reported, one for each of these seemingly distinct (but 
in reality identical) attributes. A planned future release of our 
system will alleviate this problem through the use of 
standardized names. 
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